This project is to help an online greeting card company to understand the life-time-value of their customers. The company want to further improve their revenue by identifying loyal customers and ways they can retain them. Also, the company want to identify inactive users from the active ones. To achieve this goal, it will need a customer segmentation scheme. Therefore, this project will solve the problems as folowwing:
a). A model to predict whether a customer will cancel their subscription in the near future
b). A model to estimate the life-time-value for a customer
c). A customer segmentation scheme to identify inactive vs active users
The client provided 403,835 daily usage records of 10,000 customers from January 1st, 2011 to December 31st, 2014. The dataset includes several statistics of customers’ behaviors when he/she visits the website as follows:
| Data Field | Description |
|---|---|
id |
A unique user identifier |
status |
Subscription status ‘0’- new, ‘1’- open, ‘2’- cancelation |
gender |
User gender ‘M’- male, ‘F’- female |
date |
Date of in which user ‘id’ logged into the site |
pages |
Number of pages visted by user ‘id’ on date ‘date’ |
onsite |
Number of minutes spent on site by user ‘id’ on date ‘date’ |
entered |
Flag indicating whether or not user entered the send order path on date ‘date’ |
completed |
Flag indicating whether the user completed the order (sent an eCard) |
holiday |
Flag indicating whether at least one completed order included a holiday themed card |
## 'data.frame': 403835 obs. of 9 variables:
## $ id : int 1 1 1 1 1 1 2 2 2 2 ...
## $ status : int 0 1 1 1 1 2 0 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 1 1 1 ...
## $ date : Factor w/ 1461 levels "2011-01-01","2011-01-02",..: 1400 1401 1402 1445 1449 1453 981 982 983 985 ...
## $ pages : int 7 6 6 1 1 0 7 9 7 8 ...
## $ onsite : int 3 8 20 1 1 0 23 30 19 3 ...
## $ entered : int 1 0 1 0 0 0 1 1 1 1 ...
## $ completed: int 1 0 0 0 0 0 1 1 1 0 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## id status gender date
## Min. : 1 Min. :0.0000 F:358716 2013-12-24: 1132
## 1st Qu.: 2537 1st Qu.:1.0000 M: 45119 2012-12-24: 954
## Median : 5025 Median :1.0000 2013-12-17: 821
## Mean : 5017 Mean :0.9909 2013-12-18: 808
## 3rd Qu.: 7495 3rd Qu.:1.0000 2013-10-27: 801
## Max. :10000 Max. :2.0000 2013-12-23: 789
## (Other) :398530
## pages onsite entered completed
## Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 5.000 Median : 5.000 Median :1.0000 Median :1.0000
## Mean : 5.018 Mean : 8.831 Mean :0.7821 Mean :0.5627
## 3rd Qu.: 7.000 3rd Qu.: 11.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :10.000 Max. :220.000 Max. :1.0000 Max. :1.0000
##
## holiday
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2277
## 3rd Qu.:0.0000
## Max. :1.0000
##
After initial exploration of the data, it shows:
No missing values(NA) present in the data
No discrepancy in the collection of observations
Number of females are way more than males
Data requires no further cleaning.
As the usage statistics data provided by client is at daily level, here first aggregate the data to customer level, and create more features that might help in exploring the life-time value of customers. When calculating time duration related features, we assume that the last registration date for those customers who have not cancelled their accounts are 12-31-2014, as this is the last day of effective data range of our dataset.
The description of each feature is as below:
| Feature | Description |
|---|---|
gender |
User gender ‘M’- male, ‘F’- female |
totalLoginNum |
Total number of logins of the customer over the time the customer stays with the company |
avgPages |
Average number of pages visited by the customer over the time the customer stays with the company |
maxPages |
Maximum number of pages visited by the customer for a single login over the time the customer stays with the company |
minPages |
Minimum number of pages visited by the customer for a single login over the time the customer stays with the company |
sdPages |
Standard deviation of number of pages visited by the customer over the time the customer stays with the company |
skewPages |
Skewness of number of pages visited by the customer over the time the customer stays with the company |
avgOnsite |
Average number of minutes spent on site by the customer over the time the customer stays with the company |
maxOnsite |
Maximum number of minutes spent on site by the customer for a single login over the time the customer stays with the company |
minOnsite |
Minimum number of minutes spent on site by the customer for a single login over the time the customer stays with the company |
sdOnsite |
Standard deviation of number of minutes spent on site by the customer over the time the customer stays with the company |
skewOnsite |
Skewness of number of minutes spent on site by the customer over the time the customer stays with the company |
avgOnsitePageTime |
Average number of minutes the customer stays in one page |
maxOnsitePageTime |
Maximum number of minutes the customer stays in one page |
minOnsitePageTime |
Minimum number of minutes the customer stays in one page |
sdOnsitePageTime |
Standard deviation of number of minutes the customer stays in one page |
skewOnsitePageTime |
Skewness of number of minutes the customer stays in one page |
avgEntered |
Average number of entered orders of the customer |
sumEntered |
Total number of entered orders of the customer |
avgCompleted |
Average number of completed orders of the customer |
sumCompleted |
Total number of completed orders of the customer |
cmplOverEtr |
Ratio of total completed orders over total entered orders of the customer |
YesHoliday |
Total number of completed orders on holiday |
NoHoliday |
Total number of completed orders that are not made on holiday |
conversion |
Average conversion rate from page to order entered page of the customer |
firstMonthLoginNum |
Number of login during the first month of the customer’s subscription |
lastMonthLoginNum |
Number of login during the last month of the customer’s subscription subscription, for customers who haven’t cancelled, their last month is December 2014 |
oneMonthLoginRatio |
Ratio of firstMonthLoginNum over lastMonthLoginNum |
Q1 |
Number of login during the first quarter (January to March every year) of the customer’s subscription |
Q2 |
Number of login during the second quarter (April to June every year) of the customer’s subscription |
Q3 |
Number of login during the third quarter (July to September every year) of the customer’s subscription |
Q4 |
Number of login during the last quarter (October to December every year) of the customer’s subscription |
avgDateDiff |
Average time difference between adjacent logins of the customer |
maxDateDiff |
Maximum time difference between adjacent logins of the customer |
minDateDiff |
Minimum time difference between adjacent logins of the customer |
sdDateDiff |
Standard deviation of time differences between adjacent logins of the customer |
With these features, next start to analyze the following questions:
In the following sections, we will walk through the features and predictive/descriptive methods we have fitted in each problem, compare the performance of various models we have tried, and summarize findings. Before fitting each model, we split our data into training set (80% of all customer data) and test set (the rest 20% of the customer data), train the models on training set and compare them by their performance on the test set.
This is a classification problem, therefore, we consider: Logistic Regression with lasso, Random Forest, Naive Bayes and K-Nearst Neighbour model.
These methods were considered for this scenario because it is a classic classification problem. There are advantages and disadvantages for all the mentioned models. We will briefly go over them before discussing the process of selecting the model.
These are the following advantages and disadvantages considered for the following models:
Random Forest: Some of the advantages considered for this model are that it is a highly accurate learning algorithm and highly flexible. It also indicates variable importance. But the disadvantage is that it will be hard to interpret.
Regularized Lasso: The advantages for regularized lasso is that it automatically select the best variables, but it has a hard time detecting interaction term in the model. It also makes the assumption that the model will be linear.
Naive Bayes: Scales well to problems with large number of predictor. However, the downside with this approach is that all the inputs are independent in each class (can be a problem if the variables are collinear).
K-NN Model: KNN is a model that is highly flexible, but hard to interpret
To select an optimized model, the data scientists utilized K-fold cross validation to determine the overall accuracy of the estimate of the train set. Afterwards, the interpretability and flexibility of the model was considered for the business scenario in which we are accurately trying to identify the customers who will cancel their subscription.
The advantages for regularized lasso is that it automatically select the best variables, but it has a hard time detecting interaction term in the model. It also makes the assumption that the model will be linear.
From the plot above, we choose optimal \(\lambda\) = 0.0036 using 1-SE rule, and the optimal model selects 24 variables. The variables and their coefficients are in below table:
| Coefficient | |
|---|---|
| (Intercept) | 5.593 |
| genderM | -0.946 |
| avgPages | -5.984 |
| maxPages | 1.377 |
| minPages | 1.322 |
| sdPages | 3.551 |
| skewPages | -7.423 |
| maxOnsite | 0.001 |
| avgOnsitePageTime | -0.006 |
| minOnsitePageTime | -1.535 |
| skewOnsitePageTime | 0.074 |
| avgEntered | -1.373 |
| avgCompleted | -2.575 |
| YesHoliday | 0.020 |
| conversion | 5.567 |
| firstMonthLoginNum | 0.172 |
| lastMonthLoginNum | 0.498 |
| firstMonthLoginRatio | 3.413 |
| Q1 | 0.021 |
| Q2 | 0.011 |
| Q3 | 0.002 |
| Q4 | -0.003 |
| avgDateDiff | 0.044 |
| maxDateDiff | 0.007 |
| minDateDiff | -0.832 |
Then we test the model on test data, and get confusion matrix as below.
## Observation
## Prediction 0 1
## 0 599 92
## 1 129 1178
The misclassification rate on test data is 11.06%.
## result
## accuracy 0.8893894
## sensitivity 0.9275591
## specificity 0.8228022
## ppv 0.9013007
## npv 0.8668596
## precision 0.9013007
## recall 0.9275591
Some of the advantages considered for this model are that it is a highly accurate learning algorithm and highly flexible. It also indicates variable importance. But the disadvantage is that it will be hard to interpret. Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. We build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. Fit random forests to the training data, and get the importance of different variables as below:
Then we test the random forests on test set, and get confusion matrix as below:
## Observation
## Prediction 0 1
## 0 539 69
## 1 189 1201
The misclassification rate on test data is 12.91%.
## result
## accuracy 0.8708709
## sensitivity 0.9456693
## specificity 0.7403846
## ppv 0.8640288
## npv 0.8865132
## precision 0.8640288
## recall 0.9456693
Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features. Scales well to problems with large number of predictor. However, the downside with this approach is that all the inputs are independent in each class (can be a problem if the variables are collinear). Although our features may not be totally independent given the class value, Naive Bayes sometimes perform well on large feature space. Thus we decide to try it. We fit Naive Bayes to the training data, then test it on test set and get confusion matrix as below:
## Observation
## Prediction 0 1
## 0 510 182
## 1 218 1088
The misclassification rate on test data is 20.02%.
## result
## accuracy 0.7997998
## sensitivity 0.8566929
## specificity 0.7005495
## ppv 0.8330781
## npv 0.7369942
## precision 0.8330781
## recall 0.8566929
KNN is a model that is highly flexible, but hard to interpret
Then we test the KNN on test set, and get confusion matrix as below:
## Observation
## Prediction 0 1
## 0 327 258
## 1 401 1012
The misclassification rate on test data is 32.98%.
## result
## accuracy 0.6701702
## sensitivity 0.7968504
## specificity 0.4491758
## ppv 0.7162067
## npv 0.5589744
## precision 0.7162067
## recall 0.7968504
By comparing the test set performance of these 4 models, we identify that Logistic Regression with lasso (\(\lambda\) value chosen by 1-SE rule) performs the best. Below is the table comparing the misclassification rate of 4 models, among which Logistic Regression with lasso generates the lowest misclassification rate.
| Model | Misclassification Rate |
|---|---|
Logistic Regression with lasso |
11.06% |
Random Forests |
12.91% |
Naive Bayes |
20.02% |
KNN |
32.98% |
From the ROC curves of Logistic Regression with lasso (black curve), KNN (steel blue curve), and Random Forests (green), we can also tell that Logistic Regression with lasso performs best on predicting whether a customer will unsubscribe the service in near future. Compared with the prevalence of cancelled customers in our training dataset, 63.14%, the 88.94% accuracy rate that Logistic Regression with lasso has achieved is outstanding, thus we can apply this model to future users and make strategies to activate them if we predict that they will cancel their account.
In problem 2, we estimate the life-time value for the customers, which means the total revenue earned by the company over the course of their relationship with the customer. Therefore, this task is a regression problem as the output is numerical value. We decided to analyze this task on two different group of people: all customers, and only the customers who have already cancelled their accounts. Here try three different models - Linear Regression with lasso, Regression Tree, and Random Forest. We are using 37 features this time - 36 features listed in Feature Engineering and whether the customer has cancelled the account or not. We calculate the ‘ltv’ as the response for this regression problem, which is the real life-time value of the customers - for those who have not cancelled their accounts, we calculate their ltv by 12-31-2014 - the end date of the dataset. When comparing the performance of different classifiers, we use Mean Absolute Error as our metrics, as it can directly tell us how far away are our predictions from the true values.
Just like Model 1 in Problem 1, lasso automatically does feature selection when fitting a regularized linear regression to predict ltv. We choose the \(\lambda\) value using cross-validation, and the CV error plot is as below:
From the plot above, we choose optimal \(\lambda\) = 0.0115606 using 1-SE rule, and the optimal model selects 29 variables. The variables and their coefficients are in below table:
| Coefficient | |
|---|---|
| (Intercept) | -3.392 |
| cancelled1 | -0.266 |
| genderM | 1.560 |
| totalLoginNum | 0.202 |
| maxPages | 0.651 |
| minPages | -0.107 |
| sdPages | -1.248 |
| skewPages | 0.520 |
| avgOnsite | 0.074 |
| maxOnsite | 0.003 |
| minOnsite | 0.370 |
| minOnsitePageTime | -2.135 |
| sdOnsitePageTime | -0.542 |
| skewOnsitePageTime | 0.359 |
| avgEntered | -1.129 |
| sumEntered | -0.002 |
| avgCompleted | 2.483 |
| YesHoliday | 0.510 |
| NoHoliday | -0.237 |
| conversion | 1.066 |
| firstMonthLoginNum | -0.027 |
| lastMonthLoginNum | -0.147 |
| firstMonthLoginRatio | -2.675 |
| Q1 | 0.118 |
| Q2 | 0.065 |
| Q3 | 0.074 |
| avgDateDiff | 0.530 |
| maxDateDiff | 0.080 |
| minDateDiff | -0.911 |
| sdDateDiff | -0.343 |
Then we test the model on test data, and the Mean Absolute Value is 2.43.
Trees are highly interpretable, thus we also try to fit the regression tree on the training dataset for this regression task. We choose the optimal tree size using cross-validation, and the CV error plot is as below:
From the plot above, we get optimal tree size = 8 using 1-SE rule. We prune the tree using complexity parameter chosen by 1-SE rule, and the optimal tree is as below:
Then we test the model on test data, and the Mean Absolute Value is 3.82.
We fit Random Forests on training data for this regression task, and get the importance of different variables as below:
Then we test the model on test data, and the Mean Absolute Value is 1.13.
By comparing the test set performance of these 3 models, we identify that Random Forests perform the best. Below is the table comparing the Mean Absolute Error of 3 models, among which Random Forests generates the lowest Mean Absolute Error.
| Model | Mean Absolute Error |
|---|---|
Linear Regression with lasso |
2.43 |
Classification Tree |
3.82 |
Random Forests |
1.13 |
Therefore, Random Forests performs best on predicting the life-time value for all customers.
There are 6,314 customers who have already cancelled their accounts. We use exactly the same three models as we do for all customers.
We choose the \(\lambda\) value for lasso using cross-validation, and the CV error plot is as below:
From the plot above, we choose optimal \(\lambda\) = 0.0125846 using 1-SE rule, and the optimal model selects 30 variables. The variables and their coefficients are in below table:
| Coefficient | |
|---|---|
| (Intercept) | -4.582 |
| genderM | 1.278 |
| totalLoginNum | 0.216 |
| maxPages | 0.637 |
| minPages | -0.233 |
| sdPages | -1.165 |
| skewPages | 0.620 |
| avgOnsite | 0.093 |
| maxOnsite | 0.005 |
| minOnsite | 0.542 |
| avgOnsitePageTime | 0.087 |
| maxOnsitePageTime | 0.018 |
| minOnsitePageTime | -3.200 |
| sdOnsitePageTime | -0.879 |
| skewOnsitePageTime | 0.367 |
| sumEntered | -0.016 |
| avgCompleted | 2.140 |
| cmplOverEtr | -0.063 |
| YesHoliday | 0.534 |
| NoHoliday | -0.241 |
| conversion | 0.270 |
| firstMonthLoginNum | -0.006 |
| lastMonthLoginNum | -0.059 |
| firstMonthLoginRatio | -2.088 |
| Q1 | 0.089 |
| Q2 | 0.075 |
| Q3 | 0.082 |
| avgDateDiff | 0.505 |
| maxDateDiff | 0.073 |
| minDateDiff | -0.808 |
| sdDateDiff | -0.306 |
Then we test the model on test data, and the Mean Absolute Value is 2.04.
We also try to fit the regression tree on the training dataset for customers who have cancelled their accounts. We choose the optimal tree size using cross-validation, and the CV error plot is as below:
From the plot above, we get optimal tree size = 9 using 1-SE rule. We prune the tree using complexity parameter chosen by 1-SE rule, and the optimal tree is as below:
Then we test the model on test data, and the Mean Absolute Value is 3.61.
We fit Random Forests on training data for customers who have cancelled their accounts, and get the importance of different variables as below:
Then we test the model on test data, and the Mean Absolute Value is 0.94.
By comparing the test set performance of these 3 models, we identify that Random Forests perform the best. Below is the table comparing the Mean Absolute Error of 3 models, among which Random Forests generates the lowest Mean Absolute Error.
| Model | Mean Absolute Error |
|---|---|
Linear Regression with lasso |
2.04 |
Classification Tree |
3.61 |
Random Forests |
0.94 |
Therefore, Random Forests performs best on predicting the life-time value for customers who have already cancelled their accounts as well. We can see that the Mean Absolute Error gets further reduced for customers who have already cancelled than all customers. Customers who have not cancelled but registered at a late time can have a very short life-time value per our calculation, thus eliminating them may make the prediction task easier and the Mean Absolute Error is even lower. Random Forests give us impressively low error in this task, which means that our client could use this approach to predict the life-time value of their future users and make forecast of their revenues more accurately.
In the third problem, we need to develop a customer segmentation scheme which can help our client identify those sleeping customers, who are no longer active but have not cancelled their account yet. This is an unsupervised learning task because there is no labels for this problem. So, clustering is a great method to shed light on this problem [Reference 7].
Before we cluster the customers, we first explore the data. After plotting the number of days between the last login date and 12-31-2014, the end date of this dataset, which is used as the last day of relationship with this company for customers who have not cancelled their accounts, we think 15 days could be a reasonable cutoff for deciding whether a customer is sleeping or not. If a customer has not logged in for more than 15 days, we will identify him/her as a sleeping customer.
We have 3681 customers who have not cancelled their accounts, among whom 841 are identified as sleeping customers.
To determine the important features to be used in clustering the sleeping customers, we first fit Random Forests to the dataset as a classification task, and use the importance of variables ranked by Random Forests to identify crucial features that can tell us whether a customer is sleeping or not. The importance of features are as below:
From the importance rank shown above, we choose oneMonthLoginRatio, lastMonthLoginNum, sdDateDiff, avgDateDiff, maxDateDiff and ltv (customer life-time value calculated in Problem 2) as highly relevant features to use in clustering. Then we also conduct a feature cluster and see whether there are other features that are highly correlated with the number of days each customer is sleeping.
From the dendrogram above, we plot a horizontal red line at correlation=0.6, and we find that customers’ sleeping days is not really correlated with other features. Finally, we cluster the customers with lastMonthLoginNum, oneMonthLoginRatio, avgDateDiff, sdDateDiff, maxDateDiff, ltv, and sleepDays using Hierarchical Clustering. We scale all the features used in clustering. We are inspired by Tal Galili [Reference 2] who developed a hierarchical cluster analysis on Iris Dataset and develop below heatmap for clustering:
From the heatmap, we can tell that customers with large avgDateDiff, sdDateDiff, maxDateDiff, and sleeping days, while low lastMonthLoginNum and oneMonthLoginRatio are clustered together in the cluster marked in blue. This result aligns with our understanding of customers: customers with low activity - which can be indicated by large date difference between logins and low last month login times compared to their login times in the first month, are highly likely to sleep. Therefore, our choices of features in clustering are actually associated with the sleeping status of a customer.